MPI Reduction Operations for Sparse Floating-point Data
نویسندگان
چکیده
This paper presents a pipeline algorithm for MPI Reduce that uses a Run Length Encoding (RLE) scheme to improve the global reduction of sparse floating-point data. The RLE scheme is directly incorporated into the reduction process and causes only low overheads in the worst case. The high throughput of the RLE scheme allows performance improvements when using high performance interconnects, too. Random sample data and sparse vector data from a parallel FEM application is used to demonstrate the performance of the new reduction algorithm for an HPC Cluster with InfiniBand interconnects.
منابع مشابه
Sparse Matrix-Vector Multiplication on FPGAs
Floating-point Sparse Matrix-Vector Multiplication (SpMXV) is a key computational kernel in scientic and engineering applications. The poor data locality of sparse matrices signicantly reduces the performance of SpMXV on general-purpose processors, which rely heavily on the cache hierarchy to achieve high performance. The abundant hardware resources on current FPGAs provide new opportunities to...
متن کاملAn MPI-CUDA Implementation and Optimization for Parallel Sparse Equations and Least Squares (LSQR)
LSQR (Sparse Equations and Least Squares) is a widely used Krylov subspace method to solve large-scale linear systems in seismic tomography. This paper presents a parallel MPI-CUDA implementation for LSQR solver. On CUDA level, our contributions include: (1) utilize CUBLAS and CUSPARSE to compute major steps in LSQR; (2) optimize memory copy between host memory and device memory; (3) develop a ...
متن کاملThe Algorithms for FPGA Implementation of Sparse Matrices Multiplication
In comparison to dense matrices multiplication, sparse matrices multiplication real performance for CPU is roughly 5–100 times lower when expressed in GFLOPs. For sparse matrices, microprocessors spend most of the time on comparing matrices indices rather than performing floating-point multiply and add operations. For 16-bit integer operations, like indices comparisons, computational power of t...
متن کاملCollapsing floating-point operations
This paper addresses the issue of collapsing dependent floatingpoint operations. The presentation focuses on studying the dataflow graph of benchmark involving a large number of floating-point instructions. In particular, it focuses on the relevance of new floating-point operators performing two dependent operations which are similar to “fused multiply and add”. Finally, this paper examines the...
متن کاملSparse Non-blocking Collectives in Quantum Mechanical Calculations
For generality, MPI collective operations support arbitrary dense communication patterns. However, in many applications where collective operations would be beneficial, only sparse communication patterns are required. This paper presents one such application: Octopus, a production-quality quantum mechanical simulation. We introduce new sparse collective operations defined on graph communicators...
متن کامل